{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "dceb7cd6-56a9-4e96-acd8-fc0c558c54e3",
   "metadata": {},
   "source": [
    "# Understanding Random Forests using Python (scikit-learn)\n",
    "#### A Random Forest is a powerful machine learning algorithm that can be used for classification and regression, is interpretable, and doesn’t require feature scaling. Here’s how to apply it. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "588f75a4-6498-4593-a8f2-6cd238d3a132",
   "metadata": {},
   "source": [
    "[Decision trees](https://medium.com/data-science/understanding-decision-trees-for-classification-python-9663d683c952) are a popular supervised learning algorithm with benefits that include being able to be used for both regression and classification as well as being easy to interpret. However, decision trees aren’t the most performant algorithm and are prone to overfitting due to small variations in the training data. This can result in a completely different tree. This is why people often turn to ensemble models like Bagged Trees and Random Forests. These consist of multiple decision trees trained on bootstrapped data and aggregated to achieve better predictive performance than any single tree could offer. This tutorial includes the following:  \n",
    "- What is Bagging\n",
    "- What Makes Random Forests Different\n",
    "- Training and Tuning a Random Forest using Scikit-Learn\n",
    "- Calculating and Interpreting Feature Importance\n",
    "- Visualizing Individual Decision Trees in a Random Forest\n",
    "\n",
    "\n",
    "As always, the code used in this tutorial is available on my [GitHub](https://github.com/mGalarnyk/Python_Tutorials/blob/master/Sklearn/CART/Random_Forest/RandomForestUsingPython.ipynb). A [video version](https://youtu.be/R9tJeEgHyeo) of this tutorial is also available on my YouTube channel for those who prefer to follow along visually. With that, let’s get started!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "747094f2-cb3c-4598-b543-ae94fc25210b",
   "metadata": {},
   "source": [
    "## What is Bagging (Bootstrap Aggregating)\n",
    "![](../images/bagging.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "342f6229-5fe3-4954-89c3-aa6e4c6f1b96",
   "metadata": {},
   "source": [
    "Bagged trees and random forests can be further categorized as bagging algorithms (<b>b</b>ootstrap <b>agg</b>regat<b>ing</b>). Bagging consists of two steps:\n",
    "\n",
    "1.) Bootstrap sampling: Create multiple training sets by randomly drawing samples with replacement from the original dataset. These new training sets, called bootstrapped datasets, typically contain the same number of rows as the original dataset, but individual rows may appear multiple times or not at all. On average, each bootstrapped dataset contains about 63.2% of the unique rows from the original data. The remaining ~36.8% of rows are left out and can be used for out-of-bag (OOB) evaluation. For more on this concept, see my [sampling with and without replacement blog post](https://towardsdatascience.com/understanding-sampling-with-and-without-replacement-python-7aff8f47ebe4/).\n",
    "\n",
    "2.) Aggregating predictions: Each bootstrapped dataset is used to train a different decision tree model. The final prediction is made by combining the outputs of all individual trees. For classification, this is typically done through majority voting. For regression, predictions are averaged.\n",
    "\n",
    "Training each tree on a different bootstrapped sample introduces variation across trees. While this doesn't fully eliminate correlation—especially when certain features dominate—it helps reduce overfitting when combined with aggregation. Averaging the predictions of many such trees reduces the overall variance of the ensemble, improving generalization."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de092e2f-66ab-4392-9f15-ef827638c8de",
   "metadata": {},
   "source": [
    "### What Makes Random Forests Different\n",
    "![](../images/BaggedVsRandomForests.png)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "89e4a60b-7d20-4a28-88fe-1d2703e7e77f",
   "metadata": {},
   "source": [
    "Suppose there’s a single strong feature in your dataset. In [bagged trees](https://youtu.be/urb2wRxnGz4?si=voTNstvcYQMLdlNJ), each tree may repeatedly split on that feature, leading to correlated trees and less benefit from aggregation. Random Forests reduce this issue by introducing further randomness. Specifically, they change how splits are selected during training:\n",
    "\n",
    "1). Create N bootstrapped datasets. Note that while bootstrapping is commonly used in Random Forests, it is not strictly necessary because step 2 (random feature selection) introduces sufficient diversity among the trees.\n",
    "\n",
    "2). For each tree, at each node, a random subset of features is selected as candidates, and the best split is chosen from that subset. In scikit-learn, this is controlled by the max_features parameter, which defaults to 'sqrt' for classifiers and 1 for regressors (equivalent to bagged trees).\n",
    "\n",
    "3). Aggregating predictions: vote for classification and average for regression.\n",
    "\n",
    "Note: Random Forests use [sampling with replacement (shown below) for bootstrapped datasets and sampling without replacement](https://towardsdatascience.com/understanding-sampling-with-and-without-replacement-python-7aff8f47ebe4/) for selecting a subset of features.\n",
    "\n",
    "![](../images/TOCSampleWithReplacement.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42c08ee2-996c-46f4-8cbf-ad43e6102107",
   "metadata": {},
   "source": [
    "### Out-of-Bag (OOB) Score\n",
    "\n",
    "Because ~36.8% of training data is excluded from any given tree, you can use this holdout portion to evaluate that tree's predictions. Scikit-learn allows this via the oob_score=True parameter, providing an efficient way to estimate generalization error. You'll see this parameter used in the training example later in the tutorial."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57b4b47c-a6a7-4976-93bc-6a4a9c2d6b46",
   "metadata": {},
   "source": [
    "## Training and Tuning a Random Forest in Scikit-Learn\n",
    "\n",
    "Random Forests remain a strong baseline for tabular data thanks to their simplicity, interpretability, and ability to [parallelize](https://www.anyscale.com/blog/how-to-speed-up-scikit-learn-model-training) since each tree is trained independently. This section demonstrates how to load data, [perform a train test split](https://youtu.be/rCevxk3jeKs?si=SCzxap0-l3vBSrvM), train a baseline model, tune hyperparameters using grid search, and evaluate the final model on the test set."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4e20582-7199-4673-9b5a-b9750aa61b4a",
   "metadata": {},
   "source": [
    "### Step 1: Train a Baseline Model\n",
    "Before tuning, it's good practice to train a baseline model using reasonable defaults. This gives you an initial sense of performance and lets you validate generalization using the out-of-bag (OOB) score, which is built into bagging-based models like Random Forests. This approach allows us to reserve the test set for final evaluation after tuning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "dbfa58bc-65af-4350-951f-d0f61b81014e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import libraries\n",
    "# Some imports are only used later in the tutorial\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.datasets import load_breast_cancer \n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.ensemble import RandomForestRegressor \n",
    "from sklearn.inspection import permutation_importance\n",
    "from sklearn.model_selection import GridSearchCV, train_test_split\n",
    "from sklearn import tree\n",
    "\n",
    "# Load dataset\n",
    "url = 'https://raw.githubusercontent.com/mGalarnyk/Tutorial_Data/master/King_County/kingCountyHouseData.csv'\n",
    "df = pd.read_csv(url)\n",
    "\n",
    "columns = ['bedrooms',\n",
    "            'bathrooms',\n",
    "            'sqft_living',\n",
    "            'sqft_lot',\n",
    "             'floors',\n",
    "             'waterfront',\n",
    "             'view',\n",
    "             'condition',\n",
    "             'grade',\n",
    "             'sqft_above',\n",
    "             'sqft_basement',\n",
    "             'yr_built',\n",
    "             'yr_renovated',\n",
    "             'lat',\n",
    "             'long',\n",
    "             'sqft_living15',\n",
    "             'sqft_lot15',\n",
    "             'price']\n",
    "\n",
    "df = df[columns]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "71806480-03f1-450c-ba46-4041be613e8f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Baseline OOB score: 0.861\n"
     ]
    }
   ],
   "source": [
    "# Define features and target\n",
    "X = df.drop(columns='price')\n",
    "y = df['price']\n",
    "\n",
    "# Train/test split\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n",
    "\n",
    "# Train baseline Random Forest\n",
    "reg = RandomForestRegressor(\n",
    "    n_estimators=100,        # number of trees\n",
    "    max_features=1/3,        # fraction of features considered at each split\n",
    "    oob_score=True,          # enables out-of-bag evaluation\n",
    "    random_state=0\n",
    ")\n",
    "\n",
    "reg.fit(X_train, y_train)\n",
    "\n",
    "# Evaluate baseline performance using OOB score\n",
    "print(f\"Baseline OOB score: {reg.oob_score_:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "abbcce20-e848-474f-96b8-f3e3e536988b",
   "metadata": {},
   "source": [
    "### Step 2: Tune Hyperparameters with Grid Search\n",
    "While the baseline model gives a strong starting point, performance can often be improved by tuning key hyperparameters. Grid search cross-validation, as implemented by GridSearchCV, systematically explores combinations of hyperparameters and uses cross-validation to evaluate each one, selecting the configuration with the highest validation performance.The most commonly tuned hyperparameters include:\n",
    "- n_estimators: The number of decision trees in the forest. More trees can improve accuracy but increase training time.\n",
    "- max_features: The number of features to consider when looking for the best split. Lower values reduce correlation between trees.\n",
    "- max_depth: The maximum depth of each tree. Shallower trees are faster but may underfit.\n",
    "- min_samples_split: The minimum number of samples required to split an internal node. Higher values can reduce overfitting.\n",
    "- min_samples_leaf: The minimum number of samples required to be at a leaf node. Helps control tree size.\n",
    "- bootstrap: Whether bootstrap samples are used when building trees. If False, the whole dataset is used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "5357b237-e9a4-4af1-b164-5b3752529c4e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best parameters: {'max_depth': 20, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}\n",
      "Best R^2 score: 0.862\n"
     ]
    }
   ],
   "source": [
    "param_grid = {\n",
    "    'n_estimators': [100],\n",
    "    'max_features': ['sqrt', 'log2', None],\n",
    "    'max_depth': [None, 5, 10, 20],\n",
    "    'min_samples_split': [2, 5],\n",
    "    'min_samples_leaf': [1, 2]\n",
    "}\n",
    "\n",
    "# Initialize model\n",
    "rf = RandomForestRegressor(random_state=0, oob_score=True)\n",
    "\n",
    "grid_search = GridSearchCV(\n",
    "    estimator=rf,\n",
    "    param_grid=param_grid,\n",
    "    cv=5,             # 5-fold cross-validation\n",
    "    scoring='r2',     # evaluation metric\n",
    "    n_jobs=-1         # use all available CPU cores\n",
    ")\n",
    "\n",
    "grid_search.fit(X_train, y_train)\n",
    "\n",
    "print(f\"Best parameters: {grid_search.best_params_}\")\n",
    "print(f\"Best R^2 score: {grid_search.best_score_:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10f5e2e4-6057-4e37-9b5f-3c9fc72ce2f7",
   "metadata": {},
   "source": [
    "### Step 3: Evaluate Final Model on Test Set\n",
    "Now that we’ve selected the best-performing model based on cross-validation, we can evaluate it on the held-out test set to estimate its generalization performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "741bad7e-e963-4ff7-978c-7351c742693e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test R^2 score (final model): 0.889\n"
     ]
    }
   ],
   "source": [
    "# Evaluate final model on test set\n",
    "best_model = grid_search.best_estimator_\n",
    "print(f\"Test R^2 score (final model): {best_model.score(X_test, y_test):.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d54f6a28-c5e6-40f3-8533-3d64e3454d20",
   "metadata": {},
   "source": [
    "## Calculating Random Forest Feature Importance\n",
    "One of the key advantages of Random Forests is their interpretability — something that large language models (LLMs) often lack. While LLMs are powerful, they typically function as black boxes and can [exhibit biases that are difficult to identify](https://youtu.be/2v18R02mq8I?si=oeJadtZT3ytFmTE8). In contrast, scikit-learn supports two main methods for measuring feature importance in Random Forests: Mean Decrease in Impurity and Permutation Importance.\n",
    "\n",
    "1). Mean Decrease in Impurity (MDI): Also known as Gini importance, this method calculates the total reduction in impurity brought by each feature across all trees. This is fast and built into the model via ```reg.feature_importances_```. However, impurity-based feature importances can be misleading, especially for features with high cardinality (many unique values), as these features are more likely to be chosen simply because they provide more potential split points."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c2c867e8-4332-4ae6-a3c8-711abcd310df",
   "metadata": {},
   "outputs": [],
   "source": [
    "importances = reg.feature_importances_\n",
    "feature_names = X.columns\n",
    "sorted_idx = np.argsort(importances)[::-1]\n",
    "for i in sorted_idx:\n",
    "    print(f\"{feature_names[i]}: {importances[i]:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de44ef51-3845-42c0-8c1a-c9b8cb71290d",
   "metadata": {},
   "source": [
    "2). Permutation Importance: This method assesses the decrease in model performance when a single feature’s values are randomly shuffled. Unlike MDI, it accounts for feature interactions and correlation. It is more reliable but also more computationally expensive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ede19b2-3fbe-4d85-9c1f-161617300416",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Perform permutation importance on the test set\n",
    "perm_importance = permutation_importance(reg, X_test, y_test, n_repeats=10, random_state=0)\n",
    "sorted_idx = perm_importance.importances_mean.argsort()[::-1]\n",
    "for i in sorted_idx:\n",
    "    print(f\"{X.columns[i]}: {perm_importance.importances_mean[i]:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd6b0445-626a-48db-98fb-6ad87d518e0d",
   "metadata": {},
   "source": [
    "It is important to note that incorporating geographic features into your model can improve performance and realism. In our case, lat and long are useful as the plot below shows. It’s likely that companies like Zillow leverage location information extensively in their valuation models."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c7650da1-535a-4cca-b8c4-167e19fe4e4b",
   "metadata": {},
   "source": [
    " ![](../images/KingCountyHousingPrices.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a02a2c2-796e-460b-9c5a-fa63a77f4b7a",
   "metadata": {},
   "source": [
    "## Visualizing Individual Decision Trees in a Random Forest\n",
    "\n",
    "A Random Forest consists of multiple decision trees—one for each estimator specified via the n_estimators parameter. After training the model, you can access these individual trees through the .estimators_ attribute. Visualizing a few of these trees can help illustrate how differently each one splits the data due to bootstrapped training samples and random feature selection at each split. While the earlier example used a RandomForestRegressor, here we demonstrate this visualization using a RandomForestClassifier trained on the Breast Cancer Wisconsin dataset to highlight Random Forests' versatility for both regression and classification tasks. [This short video](https://www.youtube.com/embed/X8UeOrsUKQ4) demonstrates what 100 trained estimators from this dataset look like."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8fbe975a-e5b9-4603-bf89-f0b186327d2a",
   "metadata": {},
   "source": [
    "### Fit a Random Forest Model using Scikit-Learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5c1cad10-f591-4a5d-86fa-a10cfe7b444d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the Breast Cancer (Diagnostic) Dataset\n",
    "data = load_breast_cancer()\n",
    "df = pd.DataFrame(data.data, columns=data.feature_names)\n",
    "df['target'] = data.target\n",
    "\n",
    "# Arrange Data into Features Matrix and Target Vector\n",
    "X = df.loc[:, df.columns != 'target']\n",
    "y = df.loc[:, 'target'].values\n",
    "\n",
    "# Split the data into training and testing sets\n",
    "X_train, X_test, Y_train, Y_test = train_test_split(X, y, random_state=0)\n",
    "\n",
    "# Random Forests in `scikit-learn` (with N = 100)\n",
    "rf = RandomForestClassifier(n_estimators=100,\n",
    "                            random_state=0)\n",
    "rf.fit(X_train, Y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f0ec5b90-965d-4277-8ed0-aa3204816a88",
   "metadata": {},
   "source": [
    "### Plotting Individual Estimators (decision trees) from a Random Forest using Matplotlib\n",
    "\n",
    "You can now view all the individual trees from the fitted model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a1069e5e-43fd-4e6d-a514-f2dfa406fecd",
   "metadata": {},
   "outputs": [],
   "source": [
    "rf.estimators_"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "626b5153-4c38-4f66-a557-35ca84d84151",
   "metadata": {},
   "source": [
    "You can now visualize individual trees. The code below visualizes the first decision tree. It is important to note that individual decision trees in a Random Forest are typically grown deeper than standalone decision trees, leading to higher variance that is later reduced by aggregation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ac1d7b3-d24d-4734-bf1d-5cc23db5afac",
   "metadata": {},
   "outputs": [],
   "source": [
    "fn=data.feature_names\n",
    "cn=data.target_names\n",
    "fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (4,4), dpi=800)\n",
    "tree.plot_tree(rf.estimators_[0],\n",
    "               feature_names = fn, \n",
    "               class_names=cn,\n",
    "               filled = True);\n",
    "fig.savefig('rf_individualtree.png')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c769e177-bc52-47ca-9341-ec351f953db8",
   "metadata": {},
   "source": [
    "Although plotting many trees can be difficult to interpret, you may wish to explore the variety across estimators. The following example shows how to visualize the first five decision trees in the forest:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b0e6123c-a5a7-4969-a785-70ffe9bd778b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# This may not the best way to view each estimator as it is small\n",
    "fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(10, 2), dpi=3000)\n",
    "\n",
    "for index in range(5):\n",
    "    tree.plot_tree(rf.estimators_[index],\n",
    "                   feature_names=fn,\n",
    "                   class_names=cn,\n",
    "                   filled=True,\n",
    "                   ax=axes[index])\n",
    "    axes[index].set_title(f'Estimator: {index}', fontsize=11)\n",
    "\n",
    "fig.savefig('rf_5trees.png')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c890eec9-6ccb-4ad4-8e45-cd397444d23d",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "Random forests consist of multiple decision trees trained on bootstrapped data in order to achieve better predictive performance than could be obtained from any of the individual decision trees. If you have questions or thoughts on the tutorial, feel free to reach out through [YouTube](https://youtu.be/R9tJeEgHyeo) or [X](https://twitter.com/GalarnykMichael)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
